Goto

Collaborating Authors

 contextual word


Uniform Information Density and Syntactic Reduction: Revisiting $\textit{that}$-Mentioning in English Complement Clauses

arXiv.org Artificial Intelligence

Speakers often have multiple ways to express the same meaning. The Uniform Information Density (UID) hypothesis suggests that speakers exploit this variability to maintain a consistent rate of information transmission during language production. Building on prior work linking UID to syntactic reduction, we revisit the finding that the optional complementizer $\textit{that}$ in English complement clauses is more likely to be omitted when the clause has low information density (i.e., more predictable). We advance this line of research by analyzing a large-scale, contemporary conversational corpus and using machine learning and neural language models to refine estimates of information density. Our results replicated the established relationship between information density and $\textit{that}$-mentioning. However, we found that previous measures of information density based on matrix verbs' subcategorization probability capture substantial idiosyncratic lexical variation. By contrast, estimates derived from contextual word embeddings account for additional variance in patterns of complementizer usage.


Extracting domain-specific terms using contextual word embeddings

arXiv.org Artificial Intelligence

Automated terminology extraction refers to the task of extracting meaningful terms from domain-specific texts. This paper proposes a novel machine learning approach to terminology extraction, which combines features from traditional term extraction systems with novel contextual features derived from contextual word embeddings. Instead of using a predefined list of part-of-speech patterns, we first analyse a new term-annotated corpus RSDO5 for the Slovenian language and devise a set of rules for term candidate selection and then generate statistical, linguistic and context-based features. We use a support-vector machine algorithm to train a classification model, evaluate it on the four domains (biomechanics, linguistics, chemistry, veterinary) of the RSDO5 corpus and compare the results with state-of-art term extraction approaches for the Slovenian language. Our approach provides significant improvements in terms of F1 score over the previous state-of-the-art, which proves that contextual word embeddings are valuable for improving term extraction.1. Introduction Automated terminology extraction (ATE) refers to the task of extracting meaningful terms from domain-specific texts. Terms are single-word (SWU) or multi-word units (MWU) of knowledge, which are relevant for a particular domain. Since manual identification of terms is costly and time consuming, ATE approaches can reduce the effort needed to generate relevant domain-specific terms. Recognizing and extracting domain-specific terms, which is useful in various fields, such as translation, dictionary creation, ontology generation and others, remains a difficult task.


Variational Language Concepts for Interpreting Foundation Language Models

arXiv.org Machine Learning

Foundation Language Models (FLMs) such as BERT and its variants have achieved remarkable success in natural language processing. To date, the interpretability of FLMs has primarily relied on the attention weights in their self-attention layers. However, these attention weights only provide word-level interpretations, failing to capture higher-level structures, and are therefore lacking in readability and intuitiveness. To address this challenge, we first provide a formal definition of conceptual interpretation and then propose a variational Bayesian framework, dubbed VAriational Language Concept (VALC), to go beyond word-level interpretations and provide concept-level interpretations. Our theoretical analysis shows that our VALC finds the optimal language concepts to interpret FLM predictions. Empirical results on several real-world datasets show that our method can successfully provide conceptual interpretation for FLMs.


Semantics or spelling? Probing contextual word embeddings with orthographic noise

arXiv.org Artificial Intelligence

Pretrained language model (PLM) hidden states are frequently employed as contextual word embeddings (CWE): high-dimensional representations that encode semantic information given linguistic context. Across many areas of computational linguistics research, similarity between CWEs is interpreted as semantic similarity. However, it remains unclear exactly what information is encoded in PLM hidden states. We investigate this practice by probing PLM representations using minimal orthographic noise. We expect that if CWEs primarily encode semantic information, a single character swap in the input word will not drastically affect the resulting representation,given sufficient linguistic context. Surprisingly, we find that CWEs generated by popular PLMs are highly sensitive to noise in input data, and that this sensitivity is related to subword tokenization: the fewer tokens used to represent a word at input, the more sensitive its corresponding CWE. This suggests that CWEs capture information unrelated to word-level meaning and can be manipulated through trivial modifications of input data. We conclude that these PLM-derived CWEs may not be reliable semantic proxies, and that caution is warranted when interpreting representational similarity


Thesis Distillation: Investigating The Impact of Bias in NLP Models on Hate Speech Detection

arXiv.org Artificial Intelligence

Then, I address the identified research problems Hate speech on social media has severe negative in hate speech detection, by investigating the impacts, not only on its victims (Sticca et al., impact of bias in NLP models on hate speech 2013) but also on the moderators of social detection models from three perspectives: 1) the media platforms (Roberts, 2019). This is why explainability perspective ( 4), where I address the it is crucial to develop tools for automated hate first research problem and investigate the impact speech detection. These tools should provide of bias in NLP models on their performance of a safer environment for individuals, especially hate speech detection and whether the bias in for members of marginalized groups, to express NLP models explains their performance on hate themselves online. However, recent research shows speech detection; 2) the offensive stereotyping that current hate speech detection models falsely bias perspective ( 5), where I address the second flag content written by members of marginalized research problem and investigate the impact of communities, as hateful (Sap et al., 2019; Dixon imbalanced representations and co-occurrences of et al., 2018; Mchangama et al., 2021). Similarly, hateful content with marginalized identity groups recent research indicates that there are social biases on the bias of NLP models; and 3) the fairness in natural language processing (NLP) models (Garg perspective ( 6), where I address the third research et al., 2018; Nangia et al., 2020; Kurita et al., 2019; problem and investigate the impact of bias in Ousidhoum et al., 2021; Nozza et al., 2021, 2022). NLP models on the fairness of the task of hate Yet, the impact of these biases on the task of speech detection. For each research problem, I hate speech detection has been understudied. In summarize the work done to highlight its main my thesis, I identify and study three research findings, contributions, and limitations. Thereafter, problems: 1) the impact of bias in NLP models on I discuss the general takeaways from my thesis and the performance and explainability of hate speech how it can benefit the NLP community at large ( 7).


Semantic Change Detection for the Romanian Language

arXiv.org Artificial Intelligence

Language is in a continuous process of change that occurs permanently, language change being the phenomenon that drives language evolution, as a process of adaptation to the environment and the ways other speakers use the language [3, 2]. The various instances of language change are classified into different categories, such as regular phonetic changes, changes in word usage, and changes in the way words appear together, i.e., syntactic changes. Although it is usually a continuous process that follows regular patterns, very abrupt changes in the meanings of words can still occur, usually motivated by a change in the context a community lives in [15, 9, 7]. Semantic change, as a phenomenon permanently present in language evolution, is an important aspect that should be taken into account when working with historical data[1]. Historical linguists, lexical typologists, and other humanities and social science experts have studied the meaning of words and how it changes over time.


Two Stage Contextual Word Filtering for Context bias in Unified Streaming and Non-streaming Transducer

arXiv.org Artificial Intelligence

It is difficult for an E2E ASR system to recognize words such as entities appearing infrequently in the training data. A widely used method to mitigate this issue is feeding contextual information into the acoustic model. Previous works have proven that a compact and accurate contextual list can boost the performance significantly. In this paper, we propose an efficient approach to obtain a high quality contextual list for a unified streaming/non-streaming based E2E model. Specifically, we make use of the phone-level streaming output to first filter the predefined contextual word list then fuse it into non-casual encoder and decoder to generate the final recognition results. Our approach improve the accuracy of the contextual ASR system and speed up the inference process. Experiments on two datasets demonstrates over 20% CER reduction comparing to the baseline system. Meanwhile, the RTF of our system can be stabilized within 0.15 when the size of the contextual word list grows over 6,000.


Analyzing Vietnamese Legal Questions Using Deep Neural Networks with Biaffine Classifiers

arXiv.org Artificial Intelligence

In this paper, we propose using deep neural networks to extract important information from Vietnamese legal questions, a fundamental task towards building a question answering system in the legal domain. Given a legal question in natural language, the goal is to extract all the segments that contain the needed information to answer the question. We introduce a deep model that solves the task in three stages. First, our model leverages recent advanced autoencoding language models to produce contextual word embeddings, which are then combined with character-level and POS-tag information to form word representations. Next, bidirectional long short-term memory networks are employed to capture the relations among words and generate sentence-level representations. At the third stage, borrowing ideas from graph-based dependency parsing methods which provide a global view on the input sentence, we use biaffine classifiers to estimate the probability of each pair of start-end words to be an important segment. Experimental results on a public Vietnamese legal dataset show that our model outperforms the previous work by a large margin, achieving 94.79% in the F1 score. The results also prove the effectiveness of using contextual features extracted from pre-trained language models combined with other types of features such as character-level and POS-tag features when training on a limited dataset.


SensePOLAR: Word sense aware interpretability for pre-trained contextual word embeddings

arXiv.org Artificial Intelligence

Adding interpretability to word embeddings represents an area of active research in text representation. Recent work has explored thepotential of embedding words via so-called polar dimensions (e.g. good vs. bad, correct vs. wrong). Examples of such recent approaches include SemAxis, POLAR, FrameAxis, and BiImp. Although these approaches provide interpretable dimensions for words, they have not been designed to deal with polysemy, i.e. they can not easily distinguish between different senses of words. To address this limitation, we present SensePOLAR, an extension of the original POLAR framework that enables word-sense aware interpretability for pre-trained contextual word embeddings. The resulting interpretable word embeddings achieve a level of performance that is comparable to original contextual word embeddings across a variety of natural language processing tasks including the GLUE and SQuAD benchmarks. Our work removes a fundamental limitation of existing approaches by offering users sense aware interpretations for contextual word embeddings.


Dual Attention Model for Citation Recommendation

arXiv.org Artificial Intelligence

Based on an exponentially increasing number of academic articles, discovering and citing comprehensive and appropriate resources has become a non-trivial task. Conventional citation recommender methods suffer from severe information loss. For example, they do not consider the section of the paper that the user is writing and for which they need to find a citation, the relatedness between the words in the local context (the text span that describes a citation), or the importance on each word from the local context. These shortcomings make such methods insufficient for recommending adequate citations to academic manuscripts. In this study, we propose a novel embedding-based neural network called "dual attention model for citation recommendation (DACR)" to recommend citations during manuscript preparation. Our method adapts embedding of three dimensions of semantic information: words in the local context, structural contexts, and the section on which a user is working. A neural network is designed to maximize the similarity between the embedding of the three input (local context words, section and structural contexts) and the target citation appearing in the context. The core of the neural network is composed of self-attention and additive attention, where the former aims to capture the relatedness between the contextual words and structural context, and the latter aims to learn the importance of them. The experiments on real-world datasets demonstrate the effectiveness of the proposed approach.